Model tas-kata-kata

Model tas-kata-kata (bahasa Inggris: Bag-of-words model) ialah sebuah gambaran sederhana digunakan dalam pengolahan bahasa alami dan pencarian informasi.^[1] Dikenal sebagai model ruang vektor.^[2] Pada model ini, tiap kalimat dalam dokumen digambarkan sebagai token, mengabaikan tata bahasa dan bahkan urutan kata namun menghitung frekuensi kejadian atau kemunculan kata dari dokumen.^[2]^[3]

Contoh Implementasi

Terdapat dua dokumen teks sederhana D1 dan D2:^[1]

D1: "The Sun is a star. Sun is beautiful."

D2: "The Moon is a satellite."

Berdasar pada kedua dokumen tersebut, sebuah kamus dibangun:

{
 "The":1
 "Sun":2
 "is":3
 "a":4
 "star":5
 "beautiful":6
 "Moon":7
 "satellite":8
}

Dokumen memiliki 8 kata berbeda. Tiap dokumen digambarkan sebagai 8 unsur vektor [1, 2, 2, 1, 1, 1, 0, 0] [1, 0, 1, 1, 0, 0, 1, 1] yang mana tiap entri dari vektor mengacu pada jumlah entri dalam kamus.

Catatan kaki

^ ^a ^b Soumya George K, Shibily Joseph. Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature. IOSR Journal of Computer Engineering (IOSR-JCE) Volume 16, Issue 1, Ver. V (Jan. 2014), PP 34-38
^ ^a ^b McTear, Michael (et al.) (2016). The Conversational Interface - Talking to Smart Devices. hlm. 166.
^ Saxena, D., Saritha, S. K., & Prasad, V. (2017). Survey Paper on Feature Extraction Methods in Text Categorization. International Journal of Computer Applications, 166(11).

[s1-1] Soumya George K, Shibily Joseph. Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature. IOSR Journal of Computer Engineering (IOSR-JCE) Volume 16, Issue 1, Ver. V (Jan. 2014), PP 34-38

[s2-2] McTear, Michael (et al.) (2016). The Conversational Interface - Talking to Smart Devices. hlm. 166.

[3] Saxena, D., Saritha, S. K., & Prasad, V. (2017). Survey Paper on Feature Extraction Methods in Text Categorization. International Journal of Computer Applications, 166(11).

[1]

[2]

[3]